Refactor/preprocessing/text normalization#21
Merged
Conversation
- grouping of similar functions into submodule is the logical approach - normalization of text is part of pre-processing of raw texts
- new version uses the modular approach with .apply() method to clean texts of white space - all the keyword arguments are internally processed to make the model and run the underlying function - 📃 updated documentation of the model and function - 💣removed deprecated function strip_whitespace() from method - 💣 updated init-time optimization from the module
- 🚧 documentation and field validation pending
- added extra words options to be removed, fixes #13 - word tokenization and stop words removal are now in one modular method - 💣 this deprecates internal nlpurify/feature/selection/nltk.py methods - added attribute control to check stop words with desired case folding (upper/lower) as per final string's case folding requirements
- added example in jupyter notebooks - added preprocessing utility methods - modularize word tokenization in stop words selection
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📜 Description
This PR brings the following change(s):
Fixes # (issue number)
✔️ Checks
warningsfor existing functions.